Skip to main content

Model Selection

The way we always use is Subset Selection. We identify a subset of the pp predictors that we believe to be related to the response.

Best Subset Selection (totally compare 2p2^p models):

  1. let M0M_0 be the null model (no predictors     M0=β^0\implies M_0 = \hat\beta_0 which is the sample mean)
  2. For Mi,i=1,2,,pM_i, i = 1,2,\ldots,p (here the pp states the max features we have):
    1. Fit all (pi){p\choose i} models, then we have (pi)Mi{p\choose i} M_i models
    2. Pick the one with smallest RSS/ largest R2R^2, fix that one as true MiM_i

Stepwise Subset Selection:

Forward Stepwise (totally compare 1+p(p+1)21 + \frac{p(p+1)}{2} models):

  1. let M0M_0 be the null model
  2. For Mi,i=0,1,2,,p1M_i, i = 0, 1,2,\ldots,p - 1
    1. Consider all pip-i models that augment the predictors in MiM_i with one additional predictor
    2. Pick the one with smallest RSS/ largest R2R^2, fix that one as Mi+1M_{i + 1}

Backward Stepwise (totally compare 1+p(p+1)21 + \frac{p(p+1)}{2} models):

  1. let MpM_p be the full model
  2. For Mi,i=p,p1,,1M_i, i = p, p - 1, \ldots, 1
    1. Consider all ii models that contain all but one of the predictors in MiM_i for total i1i-1 predictors.
    2. Pick the one with smallest RSS/ largest R2R^2, fix that one as Mi1M_{i-1}

After subset selection, we have many training model. If we have such test set, we can do test MSE for each model and select the one with low test error. If not, we can do the method without test data where using cross-validated prediction error, CpC_p, AICAIC, BICBIC, or adjusted R2R^2

Cons and Pros

Best Subset Selection:

  • can be used for other types of models (i.e. logistic regression)
  • can't be used for n<pn < p
  • much computation

Forward Stepwise Selection:

  • less computation
  • can be used in high-dimensional setting with n<pn<p
  • may not find the best possible since greedy

Backward Stepwise Selection:

  • less computation
  • only used in n>pn>p
  • may not find the best possible since greedy

Gradient Descent

We have gradient descent for β^(k+1)=β^(k)αi=1n[yi+xiTβ^(k)]xi\hat \beta^{(k+1)} = \hat \beta^{(k)} - \alpha \sum_{i=1}^n[ -y_i + x_i^T\hat \beta^{(k)}]x_{i}